dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate()
adds new variables that are functions of existing
variablesselect()
picks variables based on their names.filter()
picks cases based on their values.summarise()
reduces multiple values down to a single summary.arrange()
changes the ordering of the rows.These all combine naturally with group_by()
which allows you to
perform any operation “by group”. You can learn more about them in
vignette("dplyr")
. As well as these single-table verbs, dplyr also
provides a variety of two-table verbs, which you can learn about in
vignette("two-table")
.
dplyr is designed to abstract over how the data is stored. That means as
well as working with local data frames, you can also work with remote
database tables, using exactly the same R code. Install the dbplyr
package then read vignette("databases", package = "dbplyr")
.
If you are new to dplyr, the best place to start is the data import chapter in R for data science.
# The easiest way to get dplyr is to install the whole tidyverse:
install.packages("tidyverse")
# Alternatively, install just dplyr:
install.packages("dplyr")
# Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("tidyverse/dplyr")
If you encounter a clear bug, please file a minimal reproducible example on github. For questions and other discussion, please use the manipulatr mailing list.
library(dplyr)
starwars %>% filter(species == "Droid")
the %>% is read as "and then"
#> # A tibble: 5 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr><chr> <chr> <dbl> <chr>
#> 1 C-3PO 167 75. <NA> gold yellow 112. <NA>
#> 2 R2-D2 96 32. <NA> white, blue red 33. <NA>
#> 3 R5-D4 97 32. <NA> white, red red NA <NA>
#> 4 IG-88 200 140. none metal red 15. none
#> 5 BB8 NA NA none none blackNA none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
starwars %>% select(name, ends_with("color"))
#> # A tibble: 87 x 4
#> name hair_color skin_color eye_color
#> <chr> <chr><chr> <chr>
#> 1 Luke Skywalker blondfair blue
#> 2 C-3PO <NA> gold yellow
#> 3 R2-D2 <NA> white, blue red
#> 4 Darth Vader none white yellow
#> 5 Leia Organa brownlight brown
#> # ... with 82 more rows
starwars %>% mutate(name, bmi = mass / ((height / 100) ^ 2)) %>% select(name:mass, bmi)
#> # A tibble: 87 x 4
#> name height mass bmi
#> <chr> <int> <dbl> <dbl>
#> 1 Luke Skywalker 172 77. 26.0
#> 2 C-3PO 167 75. 26.9
#> 3 R2-D2 96 32. 34.7
#> 4 Darth Vader 202 136. 33.3
#> 5 Leia Organa 150 49. 21.8
#> # ... with 82 more rows
starwars %>% arrange(desc(mass))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr><chr> <chr> <dbl> <chr>
#> 1 Jabba … 175 1358. <NA> green-tan,… orange 600. herma…
#> 2 Grievo… 216 159. none brown, whi… green, ye… NA male
#> 3 IG-88200 140. none metal red 15.0 none
#> 4 Darth … 202 136. none white yellow 41.9 male
#> 5 Tarfful 234 136. brownbrown blue NA male
#> # ... with 82 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
starwars %>% group_by(species) %>% summarise(
n = n(),
mass = mean(mass, na.rm = TRUE)
) %>% filter(n > 1)
n=n() means that n = count of rows in the summarized data.
#> # A tibble: 9 x 3
#> speciesn mass
#> <chr> <int> <dbl>
#> 1 Droid 5 69.8
#> 2 Gungan 3 74.0
#> 3 Human 35 82.8
#> 4 Kaminoan 2 88.0
#> 5 Mirialan 2 53.1
#> # ... with 4 more rows
some examples: starwars %>% filter((species == "Droid")&(skin_color=="gold")) starwars %>% filter(species == "Droid") %>% filter(skin_color=="gold") which(starwars$species == "Droid") # return indexes only 2 3 8 22 85 starwars %>% select(name, mass, ends_with("year")) starwars %>% mutate(name, bmi = mass / ((height / 100) ^ 2)) %>% select(name:mass, bmi) %>% arrange(desc(bmi)) starwars %>% group_by(species) %>% summarise(n = n()) starwars %>% group_by(species) %>% summarise(H = mean(height)) starwars %>% group_by(species) %>% count() nrow(starwars %>% filter(species == "Human")) # 35 starwars %>% filter(species == "Human") %>% count() # 35 length(unique(starwars$species)) # 38
Name | Description | |
all_vars | Apply predicate to all variables | |
compute | Force computation of a database query | |
distinct | Select distinct/unique rows | |
as.tbl_cube | Coerce an existing data structure into a tbl_cube | |
arrange | Arrange rows by variables | |
cumall | Cumulativate versions of any, all, and mean | |
copy_to | Copy a local data frame to a remote src | |
auto_copy | Copy tables to same source, if necessary | |
filter | Return rows with matching conditions | |
filter_all | Filter within a selection of variables | |
do | Do anything | |
group_by_all | Group by a selection of variables | |
check_dbplyr | dbplyr compatibility functions | |
coalesce | Find first non-missing element | |
backend_dbplyr | Database and SQL generics. | |
explain | Explain details of a tbl | |
bind | Efficiently bind multiple data frames by row and column | |
all_equal | Flexible equality comparison for data frames | |
failwith | Fail with specified value. | |
add_rownames | Convert row names to an explicit variable. | |
case_when | A general vectorised if | |
group_by_prepare | Prepare for grouping. | |
group_indices | Group id. | |
join | Join two tbls together | |
ident | Flag a character vector as SQL identifiers | |
n | The number of observations in the current group. | |
location | Print the location in memory of a data frame | |
lead-lag | Lead and lag. | |
desc | Descending order | |
n_distinct | Efficiently count the number of unique values in a set of vector | |
dim_desc | Describing dimensions | |
id | Compute a unique numeric id for each unique row in a data frame. | |
join.tbl_df | Join data frame tbls | |
order_by | A helper function for ordering window function output | |
na_if | Convert values to NA | |
band_members | Band membership | |
funs | Create a list of functions calls. | |
bench_compare | Evaluate, compare, benchmark operations of a set of srcs. | |
recode | Recode values | |
reexports | Objects exported from other packages | |
group_size | Calculate group sizes. | |
progress_estimated | Progress bar with estimated time. | |
between | Do values in a numeric vector fall in specified range? | |
nasa | NASA spatio-temporal data | |
group_by | Group by one or more variables | |
tally_ | Deprecated SE versions of main verbs. | |
select_all | Select and rename a selection of variables | |
select | Select/rename variables by name | |
sample | Sample n rows from a table | |
near | Compare two numeric vectors | |
if_else | Vectorised if | |
nth | Extract the first, last or nth value from a vector | |
init_logging | Enable internal logging | |
src_dbi | Source for database backends | |
rowwise | Group input by rows | |
src_local | A local source. | |
scoped | Operate on a selection of variables | |
dplyr-package | dplyr: a grammar of data manipulation | |
summarise_all | Summarise and mutate multiple columns. | |
summarise_each | Summarise and mutate multiple columns. | |
same_src | Figure out if two sources are the same (or two tbl have the same source) | |
dr_dplyr | Dr Dplyr checks your installation for common problems. | |
top_n | Select top (or bottom) n rows (by value) | |
tbl_vars | List variables provided by a tbl. | |
select_vars | Select variables | |
grouped_df | A grouped data frame. | |
storms | Storm tracks data | |
tidyeval | Tidy eval helpers | |
src_tbls | List all tbls provided by a source. | |
vars | Select variables | |
starwars | Starwars characters | |
summarise | Reduces multiple values down to a single value | |
with_order | Run a function with one order, translating result back to original order | |
tally | Count/tally observations by group | |
tbl | Create a table from a data source | |
tbl_cube | A data cube tbl | |
tbl_df | Create a data frame tbl. | |
groups | Return grouping variables | |
make_tbl | Create a "tbl" object | |
mutate | Add new variables | |
pull | Pull out a single variable | |
ranking | Windowed rank functions. | |
setops | Set operations | |
slice | Select rows by position | |
sql | SQL escaping. | |
src | Create a "src" object | |
common_by | Extract out common by variables | |
arrange_all | Arrange rows by a selection of variables | |
as.table.tbl_cube | Coerce a tbl_cube to other data structures | |
No Results! |
Name | ||
internals/hybrid-evaluation.Rmd | ||
compatibility.Rmd | ||
dplyr.Rmd | ||
programming.Rmd | ||
two-table.Rmd | ||
window-functions.Rmd | ||
No Results! |
Useful dplyr Functions
The R package dplyr is an extremely useful resource for data cleaning, manipulation, visualisation and analysis. It contains a large number of very useful functions and is, without doubt, one of my top 3 R packages today (ggplot2 and reshape2 being the others). Commonly used in data manipulation tasks. select() filter() mutate() group_by() summarise() arrange() join() require(dplyr) # Data file file <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" # Some sensible variable names df_names <- c("age", "wrkclass", "fnlweight", "education_lvl", "edu_score", "marital_status", "occupation", "relationship", "ethnic", "gender", "cap_gain", "cap_loss", "hrs_wk", "nationality", "income") # Import the data df <- read.csv(file, header = F, sep = ",", na.strings = c(" ?", " ", ""), row.names = NULL, col.names = df_names) Many data manipulation tasks in dplyr can be performed with the assistance of the forward-pipe operator (%>%). The first function I would like to introduce removes duplicate entries which, in fact, is a preprocessing step one may carry out in a data analysis. It is so useful that it must be included. # Remove duplicate rows and check number of rows df %>% distinct() %>% nrow() # Drop duplicate rows and assign to new dataframe object df_clean <- df %>% distinct() # Drop duplicates based on one or more variables df %>% distinct(gender, .keep_all = T) df %>% distinct(gender, education_lvl, .keep_all = T) Taking random samples of data is easy with dplyr. # Sample random rows with or without replacement sample_n(df, size = nrow(df) * 0.7, replace = F) sample_n(df, size = 20, replace = T) # Sample a proportion of rows with or without replacement sample_frac(df, size = 0.7, replace = F) sample_frac(df, size = 0.8, replace = T Renaming variables is also easy with dplyr. # Rename one or more variables in a dataframe df <- df %>% rename("INCOME" = "income") df <- df %>% rename("INCOME" = "income", "AGE" = "age") The main “verbs” of dplyr are now introduced. Let’s begin with the select() verb which filters a dataframe by column. # Select specific columns (note that INCOME is the new name from earlier) df %>% select(education_lvl, INCOME) # With dplyr 0.7.0 the pull() function extracts a variable as a vector df %>% pull(age) # Drop a column using the - operator (variable can be referenced by name or column position) df %>% select(-edu_score) df %>% select(-1, -4) df %>% select(-c(2:6)) Some useful helper functions are available in dplyr and can be used in conjunction with the select() verb. Here are some quick examples. # Select columns with their names starting with "e" df %>% select(starts_with("e")) # The negative sign works for dropping here too df %>% select(-starts_with("e")) # Select columns with some pattern in the column name df %>% select(contains("edu")) # Reorder data to place a particular column at the start followed by all others using everything() df %>% select(INCOME, everything()) # Select columns ending with a pattern df %>% select(ends_with("e")) df %>% select(ends_with("_loss")) The next major verb we look at is filter() which, surprisingly enough, filters a dataframe by row based on one or more conditions. # Filter rows to retain observations where age is greater than 30 df %>% filter(age > 30) # Filter by multiple conditions using the %in% operator (make sure strings match) df %>% filter(relationship %in% c(" Unmarried", " Wife")) # You can also use the OR operator (|) df %>% filter(relationship == " Husband" | relationship == " Wife") # Filter using the AND operator df %>% filter(age > 30 & INCOME == " >50K") # Combine them too df %>% filter(education_lvl %in% c(" Doctorate", " Masters") & age > 30) # The NOT condition (filter out doctorate holders) df %>% filter(education_lvl != " Doctorate") # The grepl() function can be conveniently used with filter() df %>% filter(grepl(" Wi", relationship)) Next, we look at the summarise() verb which allows one to dynamically summarise groups of data and even pipe groups to ggplot data visualisations. # The summarise() verb in dplyr is useful for summarising grouped data df %>% filter(INCOME == " >50K") %>% summarise(mean_age = mean(age), median_age = median(age), sd_age = sd(age)) # Summarise multiple variables using summarise_at() df %>% filter(INCOME == " >50K") %>% summarise_at(vars(age, hrs_wk), funs(n(), mean, median)) # We can also summarise with custom functions # The . in parentheses represents all called variables df %>% summarise_at(vars(age, hrs_wk), funs(n(), missing = sum(is.na(.)), mean = mean(., na.rm = T))) # Create a new summary statistic with an anonymous function df %>% summarise_at(vars(age), function(x) { sum((x - mean(x)) / sd(x)) }) # Summarise conditionally using summarise_if() df %>% filter(INCOME == " >50K") %>% summarise_if(is.numeric, funs(n(), mean, median)) # Subset numeric variables and use summarise_all() to get summary statistics ints <- df[sapply(df, is.numeric)] summarise_all(ints, funs(mean, median, sd, var)) Next up is the arrange() verb which is useful for sorting data in ascending or descending order (ascending is default). # Sort by ascending age and print top 10 df %>% arrange(age) %>% head(10) # Sort by descending age and print top 10 df %>% arrange(desc(age)) %>% head(10) The group_by() verb is useful for grouping together observations which share common characteristics. # The group_by verb is extremely useful for data analysis df %>% group_by(gender) %>% summarise(Mean = mean(age)) df %>% group_by(relationship) %>% summarise(total = n()) df %>% group_by(relationship) %>% summarise(total = n(), mean_age = mean(age)) The mutate() verb is used to create new variables from existing local variables or global objects. New variables, such as sequences, can be also specified within mutate(). # Create new variables from existing or global variables df %>% mutate(norm_age = (age - mean(age)) / sd(age)) # Multiply each numeric element by 1000 (the name "new" is added to the original variable name) df %>% mutate_if(is.numeric, funs(new = (. * 1000))) %>% head() The join() verb is used to merge rows from disjoint tables which share a primary key ID or some other common variable. There are many join variants but I will consider just left, right, inner and full joins. # Create ID variable which will be used as the primary key df <- df %>% mutate(ID = seq(1:nrow(df))) %>% select(ID, everything()) # Create two tables (purposely overlap to facilitate joins) table_1 <- df[1:50 , ] %>% select(ID, age, education_lvl) table_2 <- df[26:75 , ] %>% select(ID, gender, INCOME) # Left join joins rows from table 2 to table 1 (the direction is implicit in the argument order) left_join(table_1, table_2, by = "ID") # Right join joins rows from table 1 to table 2 right_join(table_1, table_2, by = "ID") # Inner join joins and retains only complete cases inner_join(table_1, table_2, by = "ID") # Full join joins and retains all values full_join(table_1, table_2, by = "ID" That wraps up a brief demonstration of some of dplyr’s excellent functions. For additional information on the functions and their arguments, check out the help documentation using the template: ?